Goto

Collaborating Authors

 perceptual model



References

Neural Information Processing Systems

The complexity for abducing candidates could be exponential in the worst case. A.4.2 NeuralNetwork&Hyperparameters All compared methods share the same neural network structure for the same dataset. The default hyperparameters have low performance in the experiments. We fine-tune the hyperparameters and use a learningrateof10 3 andasearchstepof50. There are eight kinds ofsentencing element labels, includingrecidivism,confession,surrender, juvenile, forgiveness, no loss, pickpocket, burglary.


A Loss Function for Generative Neural Networks Based on Watson's Perceptual Model

Neural Information Processing Systems

To train Variational Autoencoders (VAEs) to generate realistic imagery requires a loss function that reflects human perception of image similarity. We propose such a loss function based on Watson's perceptual model, which computes a weighted distance in frequency space and accounts for luminance and contrast masking. We extend the model to color images, increase its robustness to translation by using the Fourier Transform, remove artifacts due to splitting the image into blocks, and make it differentiable. In experiments, VAEs trained with the new loss function generated realistic, high-quality image samples. Compared to using the Euclidean distance and the Structural Similarity Index, the images were less blurry; compared to deep neural network based losses, the new approach required less computational resources and generated images with less artifacts.



Adversarially Robust CLIP Models Can Induce Better (Robust) Perceptual Metrics

Croce, Francesco, Schlarmann, Christian, Singh, Naman Deep, Hein, Matthias

arXiv.org Artificial Intelligence

Measuring perceptual similarity is a key tool in computer vision. In recent years perceptual metrics based on features extracted from neural networks with large and diverse training sets, e.g. CLIP, have become popular. At the same time, the metrics extracted from features of neural networks are not adversarially robust. In this paper we show that adversarially robust CLIP models, called R-CLIP$_\textrm{F}$, obtained by unsupervised adversarial fine-tuning induce a better and adversarially robust perceptual metric that outperforms existing metrics in a zero-shot setting, and further matches the performance of state-of-the-art metrics while being robust after fine-tuning. Moreover, our perceptual metric achieves strong performance on related tasks such as robust image-to-image retrieval, which becomes especially relevant when applied to "Not Safe for Work" (NSFW) content detection and dataset filtering. While standard perceptual metrics can be easily attacked by a small perturbation completely degrading NSFW detection, our robust perceptual metric maintains high accuracy under an attack while having similar performance for unperturbed images. Finally, perceptual metrics induced by robust CLIP models have higher interpretability: feature inversion can show which images are considered similar, while text inversion can find what images are associated to a given prompt. This also allows us to visualize the very rich visual concepts learned by a CLIP model, including memorized persons, paintings and complex queries.


Review for NeurIPS paper: A Loss Function for Generative Neural Networks Based on Watson's Perceptual Model

Neural Information Processing Systems

Weaknesses: I have one critical concern with this paper, which is that the proposed model presented here is extremely similar to one result from "A General and Adaptive Robust Loss Function", Jonathan T. Barron, CVPR, 2019. Section 3.1 of that paper (going from the arxiv version) has results on improving reconstruction/sampling quality from VAEs by using a loss on DCT coefficients of YUV images, very similar to what is done here. They also propose a loss with a heavy-tailed distribution that looks a lot like Equation 8 of this submission, and present a method where they optimize over the scale of the loss being imposed on each coefficient of the DCT (similar to this submission). And the improvement in sample/reconstruction quality they demonstrate looks a lot like what is shown in this submission. Given these overwhelming similarities, I'm unable to support the acceptance of this paper without a comparison to the approach presented in that work.


Review for NeurIPS paper: A Loss Function for Generative Neural Networks Based on Watson's Perceptual Model

Neural Information Processing Systems

Three knowledgeable referees support acceptance, and I also recommend acceptance. The key contribution of this submission is a new reconstruction loss for VAEs (somewhat like JPEG loss) that matches human perception more closely than traditional VAE reconstruction losses (e.g. For applications where the goal is to generate sharp images rather than to maximize the likelihood of held-out data, the proposed method is a good alternative to other known ways of generating sharp images with VAEs (i.e, autoregressive/flow-based decoders and adversarial loss function). Unlike these alternatives, the proposed method introduces few additional parameters to learn from the data. R1's and R2's concern about the lack of quantitative measures of performance is justified, but the author response also makes a compelling point about the difficulty of picking a fair quantitative metric.


Perceptually Optimized Super Resolution

Karpenko, Volodymyr, Tariq, Taimoor, Condor, Jorge, Didyk, Piotr

arXiv.org Artificial Intelligence

Modern deep-learning based super-resolution techniques process images and videos independently of the underlying content and viewing conditions. However, the sensitivity of the human visual system to image details changes depending on the underlying content characteristics, such as spatial frequency, luminance, color, contrast, or motion. This observation hints that computational resources spent on up-sampling visual content may be wasted whenever a viewer cannot resolve the results. Motivated by this observation, we propose a perceptually inspired and architecture-agnostic approach for controlling the visual quality and efficiency of super-resolution techniques. The core is a perceptual model that dynamically guides super-resolution methods according to the human's sensitivity to image details. Our technique leverages the limitations of the human visual system to improve the efficiency of super-resolution techniques by focusing computational resources on perceptually important regions; judged on the basis of factors such as adapting luminance, contrast, spatial frequency, motion, and viewing conditions. We demonstrate the application of our proposed model in combination with network branching, and network complexity reduction to improve the computational efficiency of super-resolution methods without visible quality loss. Quantitative and qualitative evaluations, including user studies, demonstrate the effectiveness of our approach in reducing FLOPS by factors of 2$\mathbf{x}$ and greater, without sacrificing perceived quality.


A Loss Function for Generative Neural Networks Based on Watson's Perceptual Model

Neural Information Processing Systems

To train Variational Autoencoders (VAEs) to generate realistic imagery requires a loss function that reflects human perception of image similarity. We propose such a loss function based on Watson's perceptual model, which computes a weighted distance in frequency space and accounts for luminance and contrast masking. We extend the model to color images, increase its robustness to translation by using the Fourier Transform, remove artifacts due to splitting the image into blocks, and make it differentiable. In experiments, VAEs trained with the new loss function generated realistic, high-quality image samples. Compared to using the Euclidean distance and the Structural Similarity Index, the images were less blurry; compared to deep neural network based losses, the new approach required less computational resources and generated images with less artifacts.


Self-supervised Spatio-Temporal Graph Mask-Passing Attention Network for Perceptual Importance Prediction of Multi-point Tactility

He, Dazhong, Liu, Qian

arXiv.org Artificial Intelligence

While visual and auditory information are prevalent in modern multimedia systems, haptic interaction, e.g., tactile and kinesthetic interaction, provides a unique form of human perception. However, multimedia technology for contact interaction is less mature than non-contact multimedia technologies and requires further development. Specialized haptic media technologies, requiring low latency and bitrates, are essential to enable haptic interaction, necessitating haptic information compression. Existing vibrotactile signal compression methods, based on the perceptual model, do not consider the characteristics of fused tactile perception at multiple spatially distributed interaction points. In fact, differences in tactile perceptual importance are not limited to conventional frequency and time domains, but also encompass differences in the spatial locations on the skin unique to tactile perception. For the most frequently used tactile information, vibrotactile texture perception, we have developed a model to predict its perceptual importance at multiple points, based on self-supervised learning and Spatio-Temporal Graph Neural Network. Current experimental results indicate that this model can effectively predict the perceptual importance of various points in multi-point tactile perception scenarios.